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1 , INTRODUCTION 



The equivalence problem is to determine the finest partition 
on a set that is consistent with a sequence of assertions of the 
form "x - y". A strategy for doing this on a computer processes 
the assertions serially, maintaining always in storage a represen- 
tation of the partition defined by the assertions so far encoun- 
tered. To process the command "x = y", the equivalence classes of 
x and y are determined. If they are the same, nothing further is 
done; otherwise the two classes are merged together. 

Caller and Fischer (1964A) give an algorithm for solving this 
problem based on tree structures, and it also appears in Knuth 
(1968A). The items in each equivalence class are arranged in a 
tree, and each item except for the root contains a pointer to its 
father. The root contains a flag indicating that it is a root, 
and it may also contain other information relevant to the equiva- 
lence class as a whole. 

TWO operations are involved in processing a command "x = y": 
first we must find the classes containing x and y> and then these 
classes are (possibly) merged together. The find is accomplished 
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by successively following the father links up the* path from the 
given node until the root is encountered. To merge two trees 
together, the root of one is attached to the root of the other, 
and the former node is marked to indicate that it is no longer a 
root in the new data structure. 

The time required to accomplish a find depends on the length 
of the path from the given node to the root of its tree, while the 
time co process a merge (given the roots oC the two trees involved) 
is a constant. For definiteness, we let the cost of a merge be 
unity and the cost of a find be the number of nodes, including the 
endpolnta, on the path from the given node to the root. 

In this paper, we are interested in the way the cost of a 
sequence of instructions grows as a function of its length. Using 
the above algorithm, a sequence of n merge instructions can cause 
a tree to be built with a node v of depth n» so subsequent finds 
on that node will cost n+1 units each. The sequence consisting of 
the n merge instructions followed by n copies of a find(v) instruc- 
tion will then cost n(n+2) f and it is easy co see that 0(n 2 ) is an 
upper bound as well. 

The above example suggests adding to the algorithm a "collap- 
sing rule*' which Knuth (1972A) attributes to Tritter. Every time 
a find instruction is executed, a second pass is made up the path 
from the given node to the root and each node on that path is 
attached directly to the root (except for the root itself). At 
worst this will only double the cost of the algorithm, and it may 
cause subsequent finds to be greatly speeded up. Indeed this 
turns out to be the case, for in Section 4 we show that the upper 

bound drops to 0(n ) using this heuristic. 

Another heuristic , the "weighting rule", was studied by 
Hopcroft and Ullcan (1971A) and previously known to several others, 
tfhen performing a merge, an attempt is made to keep the trees 
balanced by always accaching the tree with the smaller nunfaer of 
nodes to the root of the tree with the larger nuofcer. To do this 
efficiently requires chat extra storage be associated with each 
root in which to record the number of nodes in its tree. Hopcrofc 
and Ullman show that with che weighting, a tree of n nodes can 
have height at most log n, and it follows that an Instruction 
sequence of length n>l can therefore have cost no greater than 
0(n log n) . Moreover, in the absence of the collapsing rule, it 
is easy to construct instruction sequences whose cost does grow as 
n log n. 

Combining the collapsing rule with the weighting rule yields 
an algorithm superior to those using either heuristic alone. With 
only the collapsing rule, we exhibit in Section 3 sequences whose 



cost grows proportionally to n log n, where n Is the length of the 
sequence* and as we remarked above, a similar lower bound holds 
for just the weighting rule alone* Combining both heuristics* we 
derive in Section A an 0(n log log n) upper bound. Hopcroft and 
Ullman (1971A) claim that the upper bound is actually linear. 
However i we have a counterexample to one of their earlier lemoas, 
and although this difficulty can be overcome, we arc unable to 
follow the final part of their argument. 



2, THE ALGORITHMS 

An equivalence program over the set £ is any sequence of 
instructions of the form find(a) where a is an element of £, or 
merge(A,B|C) where A, B and C are names of equivalence classes* 
(Cf< Hopcroft and Ullman (1971A).) Find(a) returns the name of 
the equivalence class of which a is a member* and merge(A»B,C) 
combines classes A and B into a single new class C, 

We now consider two algorithms which can bo used to implement 
equivalence programs. We first need some notation. 

A for&Bt F is a set of oriented (unordered) trees over some 
set V(F) of nodes. If v is a node, then depth[F]{v) is the length 
of the path in F from v to a root, and height[f]{w) is the maxima* 
length of a path in F from v to a leaf. The depth and height will 
be written simply d&pih{v) and heighb{v) when the forest F is 
understood. The height of a tree A, height (A) , is the height of 
its root. 

The algorithms are built from three kinds of instructions 
which operate on a forest F, If v is a node, then find(v) does 
the following: 

1. If v is a root, or if father(v) is a root, then F is left 
unchanged. 

2. Otherwise, let vov-.v. v. be the (unique) path from v 

to the root v. . Then F is aodlfled by making v. the 
father of each of the nodes v q**"* v v«2" 

The coat of find(v) is 1 + depth(v). 

The instruction U-merge(u,v} has unit cost and is defined 
only when u and v are both roots. It causes the node u to become 
a direct descendant of v (and hence u is no longer a root). 

For any node v, let ti&igkt(v) be the number of nodes in the 



subtree rooted by v (and including v itself). The instruction 
i* r -:vj*£#(u t v) also has unit cost and is defined only when u and v 
are both roots. If veight(u) < veight(v), it behaves exactly like 
U-merge(u,v) ; otherwise, it causes the node v to become a direct 
descendant of u* 

We define a U-ppaqram to be any sequence of instructions con- 
sisting solely of finds and U-merges* Similarly, a W-prog2\m is 
any sequence of finds and W-merges. 

Let o be a U- (W-)prograa. Then T(a) is the total cost of 
executing the instructions of a in sequence, starting from an ini- 
tial forest F Q in which every node is a root, T(o) is undefined 

if any of the instructions In a la undefined* 



3. A LOWER BOUND FOR THE COST OF THE UNWEIGHTED ALGORITHM 

In this section* we show how to find, for each n > 0, a 
U-program a of length n such that T(a) > cn(log n) far some con- 
stant c independent of n. 

We begin by defining Inductively for each n a class S of 
trees: n 

<i) Any tree consisting of Just a single node is an 5- tree. 

<ii) Let A and B be S , trees, and assume that A and 8 have 

n-i 

no nodes in common. Then the tree obtained by 

attaching the root of A to the root of B is an S tree. 

n 

Figure 3.1 illustrates the building of an S tree, and Figure 3*2 
shows an S tree. 

L£ttro 3. 1. Let A be an S tree. Then A ha* 2 n nodes, 
hslght(A) ■ n, and A contains a unique node of depth n. 

Proof, Trivial induction on n. □ 




Figure 3.1. Definition of an S tree. 

n 



In light of the lemma, we define the handle of an S tree t< 
be che unique node of depth n. 

Two alternate characterizations of S trees are Illustrated 
in Figure 3*3 and atated in: n 



L&ftFfa 3. 



Let A be an S tree with handle v. 
n 



(a) There exist disjoint trees A-,..., A , not containing v 

u n-1 



wich roots 

tree, < i < n-l, and (2) A is the result of attaching v to 



n> ... f a , respectively such that (1) A, is an S 



and a to a - for each i, < i < n-l. 



(b> There exist disjoint trees A' A r 1 with roots 



v 



• • • | H 



n-l 



respectively and a node u not in any A' such that 



(1) Aj is an S t tree, < 1 < n-l, and (2) A is the result of 

attaching a' to u for each i, < i < n-l- Moreover, v is the 

handle of A 1 , . 
n— J 

Proof* Again the proof is a trivial induction on n and is 
oaitted. Q 




Figure 3.2. An 5, trei 




Figure 3.3. Decompositions of an S tree A. 



Hie remarkable property of an S tree is that it is self- 
reproducing In the sense chat if an S tree A ie embedded In a 

larger tree B so that the root of A has depth > in B, then a 
find on the handle of A (which collapses the path above the handle) 
costs at least n+2 and the resulting tree still has an 5 tree 
embedded in itl n 

Ve now make these notions more precise. 

Definition, Let A and B be trees. A one-one function n s 
V(A) ■* V(B) is an enbeteing of A in B if for all u,v c V(A), 
u - father(v) Iff n(u) - father(n(v)) - <1 is initial (proper) if n 
maps (does not map) the root of A onto the root of B. We say that 
A is initially {properly} errbeddt&le in B if there exists an ini- 
tial (proper) embedding of A in B. 

Ltjrroa J. 3. Let A be an S tree with handle v, and assume n 

a 

is a proper embedding of A in a tree P. Then A 1 is initially 

embeddable in the tree P\ where A' is an S tree and P' results 

from the instruction find(n(v)) on P. 

Proof. The trees described below are illustrated in Figure 
3.4. 

Let A be an S tree with handle v, ond assume n is a proper 

embedding of A in P. By Lemma 3, 2(a), we may assume that v,a Q » 

,,, t a . is the path from v to the root of A, and a.,*., .a . are 
n™ l u n— l 

the roots of disjoint subtrees A Q ,.,. f A - respectively, where 
each A t is an S. tree, 0<i<n-l, 

For each 1, 0<i<n-l, let P be the subtree of P consisting of 
the nodes in (n(u) I u £ V(A ± )h 

Let A' be the tree formed as in Lemma 3.2(b) by linking each 

of the nodes a. to a new node a*. Then A 1 is an S tree, 
i n 

Let P 1 result from the execution of the instruction find(n(v)) 
on P, and let o be the root of P 1 . 

Finally, define a mapping n' from the nodes of A* to the 
ti'(u) -<ri(u) if u e V(A i ) for some i, 0<i<n-l| 

p if u B a'. 
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Figure 3.4. Trees in the proof of Lciraa 3.3. 



It remain* to show that n' is an Initial embedding of A 1 in 
P 1 



Let tt be the path from n(v) to the root of P. From the defi- 
nition of embedding, each of the nodes n{v), n(a A ),... t r,{a J 

n-1 

appears on n, and no node in ?* except for r\(a/) is in n, 0<i<n-i. 



As a consequence of the find, each of the nodes n(a.) is 

iinked directly to the root p of P', and since the path rr did not 
run through any nodes of P, except for the root, P. is a subtree 

of P 1 linked directly to o. It is easily verified that ' is an 
initial embedding of A 1 in P\ Q 

Ve now construct a costly U-program. First build an S. tree. 

Then alternately "push" it down by merging it to a new node, and 
perform a find on the handle. This find costs k+2 units and it 
leaves us with a new tree in which an S. tree is initially embed- 
ded. Thus we can repeat the "merge, find" sequence as often as 
we wish, yielding an average instruction time that approaches 
(k+3}/2. Since we can do this for arbitrary k, the cost of 
D-progra»s cannot be linear in their length. In fact, we show: 

Theorem J, For any n>0 t there exists a U-program a of length 
n such that T(q) > cn(log n) for some constant c independent of n. 

Proof, Let a- ( a 2 »... be a sequence of distinct nodes, and 
let 6 be a program of 2 -1 U-merges which builds an S. tree out of 
the nodes a.,..., a - . For each 1 > 1» let v. be the handle and r, 

the root of the tree that results from the sequence P#Yi*»"»1f* fc i 
and define y 4 - "U-merge(r - ,a - ) t find(v,)". Let o be the 

2 +l k k 

sequence B,?. V ■ where a - 2-1. Then T(a) - (2-1) + m(k+3) , 

X m 

and the length of a is n = 3m, so 

T(o) - | + 2&f± > cn(log n) (3.1) 

for some constant c. 

For n not of the form 3(2 -1), we form the next shorter 
sequence that Is of that form and then extend it arbitrarily to 
get a sequence of length exactly n. This will have the effect 
only of changing the constant in (3.1). □ 



4. UPPER BOUNDS 

We get upper bound* on the two algorithm* by considering a 
•light generalisation of a find Instruction* Flnd(u t v) behaves 
like a find(u) where we pretend that v la the root. More pre- 
cisely, find(u w v) is defined only If v Is an ancestor of u. If 
that is the case, let u-u^u. u.-v be the path from u to v. 

Then find(u,v) causes each of the nodes u n , "*» u k»5: to ^° "toched 

directly to v. Its cost is defined to be k+1. A sequence of 
generalized find and U- {W-)merge instructions Is called a 
generalized V- (W^)program* 

notation* Let P be a forest and a a program. Then F:a Is 
the forest that results from P by executing the instructions in a. 

Lerrmi 4*2. Let u be any node in a forest F. Then there 
exists a node v in P such that F:find{u) - P:find(u t v) and the 
costs of executing flnd(u) and flnd(u t v) are che same. 

Proof* Choose v to be the root of the tree containing u. □ 

Applying Lemma 4.1 in turn to each of the find instructions 
in * U- or W-program a gives the following: 

Lerrma 4*2* Let a be a U- (W-) program and P a forest. Then 
there exists a generalized U- (W-)program S such chat F:a * F:S 
and T(n) - T(B) . 

Generalized programs are convenient to deal with because there 
is no loss of generality in restricting attention to programs in 
which all the merges precede all the finds. 

Letrma 4*3, Let F be a forest containing the nodes p, q, u 
and v and lee H be the instruction U-aerge(p,q) (W-foerge(p,q)) . 
Let a. ■ "find(u,v) ( M" and a. =» "M f find(u,v)". If a. Is defined 

on F, then F:a. ■ F:a ? and T(a.) ■ T(a.}. 

Proof* The only possible effects of M are to change the 
father of p to be q» or to change the father of q to be p. 
Similarly , the only possible effects of the instruction find(u,v) 
are to change che fathers of the nodes on the path from u to v 
(but not including the last two such nodes). Since a is defined, 

then v is an ancestor of u and both p and q are roots in F; hence 
the sets of father links changed by the two instructions are dis- 
joint. Moreover, the choice of whether to link p to q or q to p 
in case H is a W-merge Instruction depends only on the weights of 
p and q, and the weight of a root is not affected by a find 



instruction- Hence, neither instruction affects the action of the 
other, so F:a, ■ F:a- and T(a.) ■ T(ct.) . □ 

Lcasii 4.3 enables one to convert a generalized program Into 
an equivalent one in which all the merges precede all the finds* 

Lerrma 4.4. Let a be a generalized program, and let B result 
from a by moving all the merge instructions left In the sequence 
before all the finds, but preserving the order of the merges and 
the order of the finds. Then F:a = F:d and T(a) ■ T(8). 

To bound the cost of a generalised U-program, we consider the 
affects of a U-merge and a generalized find instruction on the 
total path length of a forest F, defined to be 

£ depth (v), 
veV(F) 

Lerrma 4*5. Let o be a sequence of n U-merge instructions and 
let F - p n :a ' Then the total path length of F < n 2 * 

Proof. No node in F can have depth > n, and at most n nodes 
have non-zero depth. Hence, the total path length < n 2 . D 

Lemna 4.5. A generalized find instruction of cost t > 2 
reduces the total path length by at least (£-2) 2 /2. 

Proof. Let flnd(u,v) be an Instruction of cost -t. Then there 
Is a path u-u Q ,u. ,. .. ,u, =v from u to v. For each i, < i < £-3, 

the find causes the depth of node u to become one plus the depth 

of v, so the reduction in total path length is at least 

i-3 i-3 t-2 n ^2 

I (depthOO - (l+dopth(v)>) - I (*-2-i> = I j > ^ ? * -D 
1*0 x 1-0 j=l 

Theorem 2, Let a be a U-program of length n. Then 
T(a) < en for some constant c independent of n. 

Proof. By Lemma 4.2, It suffices to bound a generalized 
U-program a instead, and by Lemma 4. 4, ve may assume that all the 
U-merges in a precede all the finds. 

A program of length n clearly has at most n merge Instructions 
and at most n find Instructions. Let l ± be the cost of the i th 

find instruction if there is one and if not. Clearly, 



T(cO < n + J t r {4>1) 

By Lemma 4.5, the forest after executing the merge instruc- 
tions in o can have a total path length of at most n 2 . Only the 
find instructions of cost greater than two affect the tree* so let 
I - (i | *j>2}. If id, Lecaa 4.6 asserts that the i L h find 

instruction decreases the total path length by at least {1 A ~2) Z 12. 
The total path length at the end cannot be negative, so 

n 2 > i I (t -2) 2 > i I (a_-a> 2 - 2n (4,2) 

* id x l i-i i 

or 6n 2 > J C^-2) 2 . (4 . 3) 

a 

The maximum value for T fc is achieved when all the 

lj s are equal, for if they are not all the same, replacing each 

by the mean I can only cause \ <*.-2) 2 to decrease. Hence, from 

i-1 * 
(4.1) and (4.3) we get 

T(q) < n + Til (4 # 4) 

vhsre £ is subject to the constraint that 

6o 2 > n(i-2) 2 . (4.5) 

Prom (4.5), 

a ;2+^ (4. 6 ) 

and substituting into (4.4), we get 

T(o) < n + n(2 + Sbu) < 6n 3/2 . D (4.7) 

For the case of the weighted algorithm, ve prove an upper 
bound of 0(n log log n) using a method similar to our proof of 
Theorem 2. 

We say that a forest F is buildable if it can be obtained 
from F fl by a sequence of W-merge instructions. Buildable forests 

have the important property that most nodes have low height. 



Lemma 4.7 (UopcroEc and Ullman (1971A)). Let F bo a build- . 
able forest. If v is a node In F of height h, then weight(v) > 2 « 

Proof* The result follows readily by induction on h. We 
leave the details to the reader. □ 

Cavoltary* Let a be a sequence of W-merge Instructions of 
length n and let F ° ^a :o * ^ or an ^ ^ > 0* ^ contains at most 

n/2 non-roots of height h. 

Pt*oof, F has exactly n non-roots, for each W-merge changes 
one root to a non-root* Suppose u.*...,^ are non-roots of height 

h. By the lemma, weight (u.) > 2 . and all the nodes counted in 

the weight of u. are non-roots, l<i<k. Hence, 

k 
n > I weight(u.) > k-2 t (6.8) 

* i-1 X " 

so k < n/2 h . 

Instead of looking at total path length, we consider a quan- 
tity Q(P»G) which depends cm two forests F and G. Our interest is 
in the case where F is a buildable forest and G results from F by 
a sequence of generalized finds, although our definition applies 
whenever V(F) - V(G) : 

Q(F,G) - I depthlGKv)^ 1161 ^^^. Q (4.9) 

veV(F> 

Le&wa 4*8* Let a be a sequence of W-merge instructions of 
length n > 1 and let F - Fgi*- Then Q(F,F) < n(log(n+l)) a . 

Proof* Ho tree in F can have mare than o+l nodes. By Leana 
4.7, a root can have height at most log(n+l) , so no node has 
height or depth greater than log(n+l). 

Let X - { veV(F) | depch[F](v) > } be the set of non-roots 
of F, Fran (4.9), we get 

Q(F,F) <: log(n+l>- I 2 h,l « ht,F1(v) i (4.10) 

vcN 

We now wish to bound R(F> - £ 2 hcl * htlFl(v > . since a root 

veN 
has height at most log(n+l) , any node veN has height at most 
H m log(n+l) * 1, so sunning over the heights of nodes, 



IHJ . 

R<P) « I {# nodes in N of height h)*2 n (4-11) 

h-0 

By the corollary to Letaaa 4.7, the number of nodes in N of height 

h Is at most n/2 , so 

»( p ) ' I ("^k)* 2 < (H+Dn - n-log(n+l). (4.12) 

" h-0 2* 

Substituting (4.12) into (4.10) gives the desired result. □ 

Lermxz 4,9* Let F be a buildable forest, 4 £ sequence of 
generalized finds* and let G = F:$. If u is a descendant of v in 
G and u#v, then height (F](u) < height(F] (v) . 

Proof* It is easy to show by induction on the length of p 
that if u is a descendant of v In G, then u is also a descendant 
of v in F. By the definition of height, it follows that 
height[F](u) < height[F](v), D 

Lemna 4,10* Let P be a buildable forest, ;■ a sequence of 
generalized finds, and lee G » F;iJ * Assume find(u,v) is defined 
on G» has cost Z > 2, and reaults in a forest G'. Then 

Q(P,G) - Q(F.G') > 2*"*. 

Proof* Let u K u~tni|U- "V be the path from u to v in Gt By 

Lemma 4.9, the heights in F of the nodes In the path are monotone 
increasing, and since heights are integral, height[F](u. _) : * 1-3* 

Ths instruction find(u,v) does not increase the depth of any node 
and it decreases the depth of u- - by one, so 

Q(F,G) - Q(F.G') > z height[FKu a . 3 ) , 2 *-3 p 

Theorem 3. Let a be a W-program of length 0>4« Then 
T(a) < cn(log log n) for soma constant c independent of n. 

Proof, By Lcamas 4-2 and 4.4, it suffices to prove the 
theorem for a generalized W-prograa a • uf of length n, where u is 
a sequence of H-merge instructions and t is a sequence of general- 
ized find instructions. 

The lengths of u and p nre clearly both at most n. Let i. be 

the cost of the i find instruction if there is one and if not. 
Then n 

T(a> < n + I *.. (4.13) 

i-J * 



Now, let F ■ f a : ^' By Lemma 4.6, 

q(F,F) < n(log(n+l)) 2 . <4.M> 

Only find instructions of cost greater than two affect the 
forest, so let I - {i | £,>2} and let G - F:*. By repeated use of 
Lemma 4 P 10, 

Q(F.F) - Q(F.G) > I 2 (t l" 3) > { £ 2 U l" 3) ) - n. (4.15) 

Lcl 1-1 

Since Q(F,G) > 0, we conclude from (4.14) and (4.15) that 

n(log(n+l)) 2 > [ 2 <fc i" 3) - n, (4.16) 

~ i-1 

so 2n(log(n+l)) 2 ) 5 2 ( V 3 \ (4.17) 

1-1 

n 
The maximum value for V I is achieved when all the £ 'a are 

equal, for If they are not all the same, replacing each by the 

n * | ~* 
mean I can only cause- \ 2^ i Co decrease. Hence, from (4.13) 

1-1 
and (4.17), we get 

T(a) < n + nl (4.18) 

where 1 la subject to the constraint that 

2n(loe(n+D> 2 > n-2 (A-3> . (4.19) 

Taking logarithms (to the base 2) , we get 

I < 3 + log 2 + 2(log log(n+l)) < 6(log log(n+l)) « (4.20) 
Substituting back Into (4.18) yields 

T(a) < n + 6n(log log(n+l)) < 13n(log log n) . (4.21) 

5. CONCLUSION 

We have considered two heuristics* the collapsing rule and 
the weighting rule, which purportedly Improve the basic tree-based 
equivalence algorithn. Our results, together with the remarks in 



che introduction, show chat each heuristic docs indeed improve che 
worse case behavior of che algorithm, and together they are better 
chan eicher alone* 

There is still a considerable gap between the lower and upper 
bounds we have been able to prove Cor che two algorithms employing 
the collapsing rule, and we are unable to show even that the 
weighted algorithm requires more than linear time. We leave as an 
open problem to construct any equivalence algorithm at all which 
can be proved to operate in linear time. 
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